This project will use R Markdown to examine and visualize data on public school enrollment in the United States from the 1970’s to the early 2000’s. Acquired from the U.S. Census Bureau, the data includes enrollment data on national, state, and county level. Using multiple packages, I will manipulate the original data sets by creating variables and functions to reshape, modify, and plot the data provided. Techniques used to achieve these goals were gained from lectures, assignments, and resources provided by ST 558: Data Science for Statisticians at North Carolina State University in Fall 2022.

Data Processing

First Steps

The data set read in below is one section of public school enrollment data. The readr package is required to compute the original .csv delimited file into an object, or data structure, R can easily use. The resulting object is a tibble named sheet1.

library(readr)
sheet1 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010187N1, EDU010187N2, EDU010188N1, EDU010188...
## dbl (20): EDU010187F, EDU010187D, EDU010188F, EDU010188D, EDU010189F, EDU010...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 1

Running the code chunk below creates a new tibble object named enrollment1 containing only the Area_name column (renamed here as area_name), STCOU, and all columns ending in “D” from sheet1. The tidyverse package is required to use chaining (%>%).

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ dplyr   1.0.9
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
enrollment1 <- sheet1 %>%
  select(Area_name, STCOU, ends_with("D")) %>% 
  rename("area_name" = Area_name)
enrollment1

Step 2

Running the code chunk below reshapes the enrollment1 tibble from wide to long format. The resulting object is a long format tibble named enrollment2 where each row, or observation, has only one enrollment value (studentsEnrolled) for the area_name. The column names ending in “D” from enrollment1 are now under the measurementCode variable.

enrollment2 <- enrollment1 %>% 
  pivot_longer(cols = 3:12, names_to = "measurementCode", values_to = "studentsEnrolled")
enrollment2

Step 3

Running the code chunk below manipulates the enrollment data further by splitting the measurementCode variable into two separate columns. The mutate() function is used to create the new variables and the substr() function parses the character string values from enrollment2’s measurementCode into a seven character measurementType and a two character year. The year variable is then converted to a two digit numeric variable. This step is required to change the two digit year into a four digit schoolYear. All unnecessary columns are then dropped from the new tibble named enrollment4.

enrollment3 <- enrollment2 %>% 
  mutate(measurementType = substr(measurementCode, 1, 7), 
         year = substr(measurementCode, 8, 9)) 

enrollment3$year <- as.numeric(enrollment3$year) 

enrollment4 <- enrollment3 %>%
  mutate(schoolYear = if_else(year <= 22 , year + 2000, year + 1900)) %>% 
  select(area_name, STCOU, measurementType, schoolYear, studentsEnrolled) 
enrollment4

Step 4

Running the code chunk below creates a tibble from the enrollment4 data set containing only county data (county) with a class called county. To achieve this for the county level data, the grep() function was used to look through the character strings in area_name to find all the rows with a comma since all county names in the data set are followed by a comma and the two letter state abbreviation.

county <- enrollment4 %>% 
  slice(grep(pattern = ", \\w\\w", area_name)) 
county
class(county) <- c("county", class(county)) 
class(county)
## [1] "county"     "tbl_df"     "tbl"        "data.frame"

Running the code chunk below creates a tibble from the enrollment4 data set containing only non-county data (noncounty) with a class called state. Similar to the process used above, the grep() function was used to look through the character strings in area_name to find all the rows that do not have a comma within the string.

noncounty <- enrollment4 %>% 
  filter(!grepl(pattern = ", \\w\\w", area_name)) 
noncounty
class(noncounty) <- c("state", class(noncounty)) 
class(noncounty)
## [1] "state"      "tbl_df"     "tbl"        "data.frame"

Step 5

Running the code chunk below alters the existing county tibble to add a new variable for the state associated with the area_name. As discussed previously, all county names in the data set are followed by a comma and the two letter state abbreviation. To create a state variable containing the state abbreviations only, the nchar() function was used to count the number of characters in the string and the substr() function saved only the last two characters.

county <- county %>% 
  mutate(state = substr(area_name, nchar(area_name) - 1, nchar(area_name))) 
county

Step 6

The U.S. Census Bureau organizes states into four regions and nine divisions. These divisions are:

  1. New England
  2. Middle Atlantic
  3. East North Central
  4. West North Central
  5. South Atlantic
  6. East South Central
  7. West South Central
  8. Mountain
  9. Pacific

Running the code chunk below alters the existing noncounty tibble to add a new variable for the division associated with the area_name. Using the Census Bureau’s list of states by division, a vector was created for each division and used to populate the division variable by searching the area_name column for any state corresponding with it. Any row with an area_name not corresponding to a division will return ERROR in the new column.

noncounty <- noncounty %>% 
  mutate(division = if_else(area_name %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT"), "New England",
                            if_else(area_name %in% c("NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Middle Atlantic",
                                    if_else(area_name %in% c("INDIANA", "ILLINOIS", "MICHIGAN", "OHIO", "WISCONSIN"), "East North Central",
                                            if_else(area_name %in% c("IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "West North Central",
                                                    if_else(area_name %in% c("DELAWARE", "DISTRICT OF COLUMBIA", "District of Columbia", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA", "WEST VIRGINIA"), "South Atlantic",
                                                            if_else(area_name %in% c("ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE"), "East South Central",
                                                                    if_else(area_name %in% c("ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "West South Central",
                                                                            if_else(area_name %in% c("ARIZONA", "COLORADO", "IDAHO", "NEW MEXICO", "MONTANA", "UTAH", "NEVADA", "WYOMING"), "Mountain",
                                                                                    if_else(area_name %in% c("ALASKA", "CALIFORNIA", "HAWAII", "OREGON", "WASHINGTON"), "Pacific", "ERROR"))))))))))
noncounty

Requirements

The data set read in below is a second section of public school enrollment data saved as a new tibble named sheet2.

sheet2 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010197N1, EDU010197N2, EDU010198N1, EDU010198...
## dbl (20): EDU010197F, EDU010197D, EDU010198F, EDU010198D, EDU010199F, EDU010...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Function for Steps 1 and 2

Running the code chunk below creates a function to execute steps 1 and 2 (executeStep12). The temporary object pivot first selects only Area_name, STCOU, and all columns ending in “D” from the data frame provided by the user and renames the area_name variable. The tibble then gets reshaped from wide to long format. The column names ending in “D” from the data frame are now under the measurementCode variable, with one enrollment value in each row. The user running the executeStep12 function has the option to name the enrollment value (x), but if they choose not to do so, the default name for the column is studentsEnrolled.

executeStep12 <- function(df, z = "studentsEnrolled"){
  pivot <- df %>%
    select(Area_name, STCOU, ends_with("D")) %>%
    rename("area_name" = Area_name) %>%
    pivot_longer(cols = 3:12, names_to = "measurementCode", values_to = z)
  return(pivot)
}
step12 <- executeStep12(sheet2)
step12

Function for Step 3

Running the code chunk below creates a function to execute step 3 (executeStep3). The first temporary object splitCode manipulates the data frame resulting from step 2 (provided by the user) by splitting the measurementCode variable into two separate columns. The mutate() function is used to create the new variables and the substr() function parses the character string values from measurementCode into a seven character measurementType and a two character year. The year variable is then converted to a two digit numeric variable in order to change the two digit year into a four digit schoolYear in the last temporary object fullYear. All unnecessary columns are then dropped from the new tibble.

executeStep3 <- function(df, z = "studentsEnrolled"){
  splitCode <- df %>%
    mutate(measurementType = substr(measurementCode, 1, 7),
           year = substr(measurementCode, 8, 9))
  splitCode$year <- as.numeric(splitCode$year)
  fullYear <- splitCode %>%
    mutate(schoolYear = if_else(year <= 22 , year + 2000, year + 1900)) %>%
    select(area_name, STCOU, measurementType, schoolYear, z)
  return(fullYear)
}
step3 <- executeStep3(step12)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(z)` instead of `z` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
step3

Function for Step 5

Running the code chunk below creates a function to execute step 5 (executeStep5). Meant to be used on the county tibble from step 4, the temporary object addState adds the new state variable for the state associated with the area_name by using the nchar() function to count the number of characters in the string and the substr() function to save only the last two characters in the string containing the two letter state abbreviation.

executeStep5 <- function(df){
  addState <- df %>% 
    mutate(state = substr(area_name, nchar(area_name) - 1, nchar(area_name)))
return(addState)
}

Function for Step 6

Running the code chunk below creates a function to execute step 6 (executeStep6). Meant to be used on the noncounty tibble from step 4, the temporary object addDivision adds a new variable for the division associated with the area_name using vectors containing all the states in each division. The %in% operator populates the division variable by searching the area_name column for any state corresponding with it. Any row with an area_name not corresponding to a division will return ERROR in the new column.

executeStep6 <- function(df){
  addDivision <- df %>% 
    mutate(division = if_else(area_name %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT"), "New England",
                            if_else(area_name %in% c("NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Middle Atlantic",
                                    if_else(area_name %in% c("INDIANA", "ILLINOIS", "MICHIGAN", "OHIO", "WISCONSIN"), "East North Central",
                                            if_else(area_name %in% c("IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "West North Central",
                                                    if_else(area_name %in% c("DELAWARE", "DISTRICT OF COLUMBIA", "District of Columbia", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA", "WEST VIRGINIA"), "South Atlantic",
                                                            if_else(area_name %in% c("ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE"), "East South Central",
                                                                    if_else(area_name %in% c("ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "West South Central",
                                                                            if_else(area_name %in% c("ARIZONA", "COLORADO", "IDAHO", "NEW MEXICO", "MONTANA", "UTAH", "NEVADA", "WYOMING"), "Mountain",
                                                                                    if_else(area_name %in% c("ALASKA", "CALIFORNIA", "HAWAII", "OREGON", "WASHINGTON"), "Pacific", "ERROR"))))))))))
  return(addDivision)
}

Function for Steps 4, 5 and 6

Running the code chunk below creates a function to execute steps 4, 5 and 6 (executeStep456).

  • The temporary county object creates a tibble from the data frame resulting from step 2 (provided by the user) containing only county data (county) by subsetting the data set based on all the rows with a comma in area_name. The class() function overwrites the class name by adding another called county. The temporary object countyStep5 runs the previously written executeStep5 function to add the state variable to the resulting countyData tibble.
  • The temporary noncounty object creates a second tibble containing only non-county data (noncounty) by subsetting the data set based on all the rows without a comma in area_name. The class() function overwrites the class name by adding another called state. The temporary object noncountyStep6 runs the previously written executeStep6 function to add the division variable to the resulting noncountyData tibble.
  • While the function can only return one object, list() allows executeStep456 to return a list containing two separate tibbles (countyData and noncountyData).
executeStep456 <- function(df){
  county <- df %>%
    slice(grep(pattern = ", \\w\\w", area_name))
  class(county) <- c("county", class(county))
  countyStep5 <- executeStep5(county) 
  noncounty <- df %>%
    filter(!grepl(pattern = ", \\w\\w", enrollment4$area_name))
  class(noncounty) <- c("state", class(noncounty))
  noncountyStep6 <- executeStep6(noncounty)
  return(list(countyData = countyStep5, noncountyData = noncountyStep6))
}
step456 <- executeStep456(step3)
step456
## $countyData
## # A tibble: 31,450 × 6
##    area_name   STCOU measurementType schoolYear studentsEnrolled state
##    <chr>       <chr> <chr>                <dbl>            <dbl> <chr>
##  1 Autauga, AL 01001 EDU0101               1997             8099 AL   
##  2 Autauga, AL 01001 EDU0101               1998             8211 AL   
##  3 Autauga, AL 01001 EDU0101               1999             8489 AL   
##  4 Autauga, AL 01001 EDU0102               2000             8912 AL   
##  5 Autauga, AL 01001 EDU0102               2001             8626 AL   
##  6 Autauga, AL 01001 EDU0102               2002             8762 AL   
##  7 Autauga, AL 01001 EDU0152               2003             9105 AL   
##  8 Autauga, AL 01001 EDU0152               2004             9200 AL   
##  9 Autauga, AL 01001 EDU0152               2005             9559 AL   
## 10 Autauga, AL 01001 EDU0152               2006             9652 AL   
## # … with 31,440 more rows
## 
## $noncountyData
## # A tibble: 530 × 6
##    area_name     STCOU measurementType schoolYear studentsEnrolled division
##    <chr>         <chr> <chr>                <dbl>            <dbl> <chr>   
##  1 UNITED STATES 00000 EDU0101               1997         44534459 ERROR   
##  2 UNITED STATES 00000 EDU0101               1998         46245814 ERROR   
##  3 UNITED STATES 00000 EDU0101               1999         46368903 ERROR   
##  4 UNITED STATES 00000 EDU0102               2000         46818690 ERROR   
##  5 UNITED STATES 00000 EDU0102               2001         47127066 ERROR   
##  6 UNITED STATES 00000 EDU0102               2002         47606570 ERROR   
##  7 UNITED STATES 00000 EDU0152               2003         48506317 ERROR   
##  8 UNITED STATES 00000 EDU0152               2004         48693287 ERROR   
##  9 UNITED STATES 00000 EDU0152               2005         48978555 ERROR   
## 10 UNITED STATES 00000 EDU0152               2006         49140702 ERROR   
## # … with 520 more rows

Wrapper Function for Steps 1, 2, 3, 4, 5 and 6

Running the code chunk below creates a wrapper function to execute data processing steps 1 through 6 (my_wrapper). Provided a URL of a .csv file from the user, the wrapper function will read in the data set in the first temporary object data, call the function to run steps 1 and 2 (executeStep12) in question12, call the function to run step 3 (executeStep3) in question3, and call the function to run the final steps (executeStep456) in question456. Since the wrapper function ends with the executeStep456 function, it returns a list containing two separate tibbles (countyData and noncountyData).

my_wrapper <- function(url, z = "studentsEnrolled"){
  data <- read_csv(url)
  question12 <- executeStep12(data, z)
  question3 <- executeStep3(question12, z)
  question456 <- executeStep456(question3)
  return(question456)
}

Call It and Combine Your Data

First Two Data Sets

Wrapper Function

Running the code chunk below calls the wrapper function my_wrapper to read in and parse the two .csv files for the first and second data sets. The resulting objects (eduA and eduB) are two lists containing two tibbles each (eduA$countyData, eduA$noncountyData, eduB$countyData and eduB$noncountyData).

eduA <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010187N1, EDU010187N2, EDU010188N1, EDU010188...
## dbl (20): EDU010187F, EDU010187D, EDU010188F, EDU010188D, EDU010189F, EDU010...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eduB <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010197N1, EDU010197N2, EDU010198N1, EDU010198...
## dbl (20): EDU010197F, EDU010197D, EDU010198F, EDU010198D, EDU010199F, EDU010...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Write Combine Function

Running the code chunk below creates a function (combineEnrollment) to combine the four tibbles resulting from the wrapper function my_wrapper into two tibbles, merging the two county level data sets (eduA$countyData and eduB$countyData) and the two non-county level data sets (eduA$noncountyData and eduB$noncountyData) using using dplyr::bind_rows(). The result is a list containing the two tibbles.

combineEnrollment <- function(df1, df2){
  countyData <- dplyr::bind_rows(df1$countyData, df2$countyData)
  noncountyData <- dplyr::bind_rows(df1$noncountyData, df2$noncountyData)
  return(list(countyData = countyData, noncountyData = noncountyData))
}

Call Combine Function

Running the code chunk below calls the combine function combineEnrollment to combine eduA and eduB into one list (eduAB) containing two tibbles corresponding to county level data (countyData) and non-county level data (noncountyData).

eduAB <- combineEnrollment(eduA, eduB)
eduAB
## $countyData
## # A tibble: 62,900 × 6
##    area_name   STCOU measurementType schoolYear studentsEnrolled state
##    <chr>       <chr> <chr>                <dbl>            <dbl> <chr>
##  1 Autauga, AL 01001 EDU0101               1987             6829 AL   
##  2 Autauga, AL 01001 EDU0101               1988             6900 AL   
##  3 Autauga, AL 01001 EDU0101               1989             6920 AL   
##  4 Autauga, AL 01001 EDU0101               1990             6847 AL   
##  5 Autauga, AL 01001 EDU0101               1991             7008 AL   
##  6 Autauga, AL 01001 EDU0101               1992             7137 AL   
##  7 Autauga, AL 01001 EDU0101               1993             7152 AL   
##  8 Autauga, AL 01001 EDU0101               1994             7381 AL   
##  9 Autauga, AL 01001 EDU0101               1995             7568 AL   
## 10 Autauga, AL 01001 EDU0101               1996             7834 AL   
## # … with 62,890 more rows
## 
## $noncountyData
## # A tibble: 1,060 × 6
##    area_name     STCOU measurementType schoolYear studentsEnrolled division
##    <chr>         <chr> <chr>                <dbl>            <dbl> <chr>   
##  1 UNITED STATES 00000 EDU0101               1987         40024299 ERROR   
##  2 UNITED STATES 00000 EDU0101               1988         39967624 ERROR   
##  3 UNITED STATES 00000 EDU0101               1989         40317775 ERROR   
##  4 UNITED STATES 00000 EDU0101               1990         40737600 ERROR   
##  5 UNITED STATES 00000 EDU0101               1991         41385442 ERROR   
##  6 UNITED STATES 00000 EDU0101               1992         42088151 ERROR   
##  7 UNITED STATES 00000 EDU0101               1993         42724710 ERROR   
##  8 UNITED STATES 00000 EDU0101               1994         43369917 ERROR   
##  9 UNITED STATES 00000 EDU0101               1995         43993459 ERROR   
## 10 UNITED STATES 00000 EDU0101               1996         44715737 ERROR   
## # … with 1,050 more rows

Writing a Generic Function for Summarizing

First steps

State Plotting Method

Running the code chunk below writes a custom plot function (plot.state) according to the state class added to the non-county level data in step 4. The temporary object avgEnroll first filters out any observations not corresponding to a division, which was programmed to return ERROR for the division variable in step 6. The data was then grouped by division and schoolYear, and a new variable avgEnrollment was added to calculate the mean value of the enrollment statistic (named studentsEnrolled by default) across the years (schoolYear) for each division. Since the enrollment statistic is user defined x which does not appear in the original data frame, the get() function needs to be used within the mean(). The function plots a line graph of the mean of the enrollment statistic (named studentsEnrolled by default) for each division per year observed (schoolYear).

plot.state <- function(df, z = "studentsEnrolled"){
   avgEnroll <- df %>%
     filter(division != "ERROR") %>%
     group_by(division, schoolYear) %>%
     summarise(avgEnrollment = mean(get(z)))
   ggplot(avgEnroll, aes(x = schoolYear, y = avgEnrollment, color = division)) +
     geom_line()
}

County Plotting Method

Running the code chunk below writes a custom plot function (plot.county) according to the county class added to the non-county level data in step 4.

  • User options were placed in the function’s input for:
    • the enrollment statistic x (named studentsEnrolled by default)
    • the state of interest st (‘IL’ by default)
    • data organization org (‘top’ indicating to organize data from largest to smallest by default)
    • number of counties to plot n (default value of 5).
  • The temporary object filterData first subsets the data to only include observations from the state indicated (st). The data was then grouped by area_name, and a new variable avgEnrollment was added to calculate the mean value of the enrollment statistic for each county.
  • The temporary object orderData sorts the data according to the user’s choice of organization org, subsets the rows data frame by the user defined n, and selects only the resulting area_name column. The result will later be used to subset the original tibble for plotting.
    • If ‘top’ is specified as org by the user, the data frame will be organized from largest to smallest and select the first n county names.
    • If ‘bottom’ is specified as org by the user, the data frame will be organized from smallest to largest and select the first n county names.
    • If something other than ‘top’ or ‘bottom’ is specified as org by the user, the stop() function prints an error message.
  • The temporary object filterOrder subsets the original data frame by orderData to select only observations specified by the user.

The function plots a line graph of the enrollment statistic for each county fitting the user specification across the years.

plot.county <- function(df, z = "studentsEnrolled", st = "IL", org = "top", n = 5){
   filterData <- df %>%
     filter(state == st) %>%
     group_by(area_name) %>%
     summarise(avgEnrollment = mean(get(z)))
   orderData <- if(org == "top") {
     arrange(filterData, desc(avgEnrollment)) %>%
       top_n(n) %>%
       select(area_name)
     } else if(org == "bottom") {
       arrange(filterData, avgEnrollment) %>%
         top_n(n) %>%
         select(area_name)
       } else {
         stop("Must specify organizational preference (org)")
         }
   filterOrder <- df[df$area_name %in% orderData$area_name, ] 
   ggplot(filterOrder, aes(x = schoolYear, y = get(z), color = area_name)) +
     geom_line()
}

Put it Together

Data Sets 1 and 2

Wrapper Function

Running the code chunk below calls the wrapper function my_wrapper to read in and parse the two .csv files for the first and second data sets. The variable name for the enrollment statistic has been changed to enrollment. The resulting objects (edu01A and edu01B) are two lists containing two tibbles each (edu01A$countyData, edu01A$noncountyData, edu01B$countyData and edu01B$noncountyData).

edu01A <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv", z = "enrollment")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010187N1, EDU010187N2, EDU010188N1, EDU010188...
## dbl (20): EDU010187F, EDU010187D, EDU010188F, EDU010188D, EDU010189F, EDU010...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
edu01B <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv", z = "enrollment")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, EDU010197N1, EDU010197N2, EDU010198N1, EDU010198...
## dbl (20): EDU010197F, EDU010197D, EDU010198F, EDU010198D, EDU010199F, EDU010...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Combine Function

Running the code chunk below calls the combine function combineEnrollment to combine edu01A and edu01B into one list (edu01AB) containing two tibbles corresponding to county level data (countyData) and non-county level data (noncountyData).

edu01AB <- combineEnrollment(edu01A, edu01B)
edu01AB
## $countyData
## # A tibble: 62,900 × 6
##    area_name   STCOU measurementType schoolYear enrollment state
##    <chr>       <chr> <chr>                <dbl>      <dbl> <chr>
##  1 Autauga, AL 01001 EDU0101               1987       6829 AL   
##  2 Autauga, AL 01001 EDU0101               1988       6900 AL   
##  3 Autauga, AL 01001 EDU0101               1989       6920 AL   
##  4 Autauga, AL 01001 EDU0101               1990       6847 AL   
##  5 Autauga, AL 01001 EDU0101               1991       7008 AL   
##  6 Autauga, AL 01001 EDU0101               1992       7137 AL   
##  7 Autauga, AL 01001 EDU0101               1993       7152 AL   
##  8 Autauga, AL 01001 EDU0101               1994       7381 AL   
##  9 Autauga, AL 01001 EDU0101               1995       7568 AL   
## 10 Autauga, AL 01001 EDU0101               1996       7834 AL   
## # … with 62,890 more rows
## 
## $noncountyData
## # A tibble: 1,060 × 6
##    area_name     STCOU measurementType schoolYear enrollment division
##    <chr>         <chr> <chr>                <dbl>      <dbl> <chr>   
##  1 UNITED STATES 00000 EDU0101               1987   40024299 ERROR   
##  2 UNITED STATES 00000 EDU0101               1988   39967624 ERROR   
##  3 UNITED STATES 00000 EDU0101               1989   40317775 ERROR   
##  4 UNITED STATES 00000 EDU0101               1990   40737600 ERROR   
##  5 UNITED STATES 00000 EDU0101               1991   41385442 ERROR   
##  6 UNITED STATES 00000 EDU0101               1992   42088151 ERROR   
##  7 UNITED STATES 00000 EDU0101               1993   42724710 ERROR   
##  8 UNITED STATES 00000 EDU0101               1994   43369917 ERROR   
##  9 UNITED STATES 00000 EDU0101               1995   43993459 ERROR   
## 10 UNITED STATES 00000 EDU0101               1996   44715737 ERROR   
## # … with 1,050 more rows

Custom Plot Function for State Class

Running the code chunk below calls the custom plot.state function to plot a line graph of the mean of the enrollment statistic (avgEnrollment) for each division per year observed (schoolYear).

plot(edu01AB$noncountyData, z = "enrollment")
## `summarise()` has grouped output by 'division'. You can override using the
## `.groups` argument.

Custom Plot Function for County Class

User Specifications 1

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (enrollment) for the 7 (n) counties (area_name) with the largest (org) enrollment values in Pennsylvania (st) across the years (schoolYear).

plot(edu01AB$countyData, z = "enrollment", st = "PA", org = "top", n = 7)
## Selecting by avgEnrollment

User Specifications 2

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (enrollment) for the 4 (n) counties (area_name) with the smallest (org) enrollment values in Pennsylvania (st) across the years (schoolYear).

plot(edu01AB$countyData, z = "enrollment", st = "PA", org = "bottom", n = 4)
## Selecting by avgEnrollment

Default Specifications 1

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (enrollment) for the 5 (n) counties (area_name) with the largest (org) enrollment values in Illinois (st) across the years (schoolYear).

plot(edu01AB$countyData, z = "enrollment")
## Selecting by avgEnrollment

User Specifications 3

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (enrollment) for the 10 (n) counties (area_name) with the largest (org) enrollment values in Minnesota (st) across the years (schoolYear).

plot(edu01AB$countyData, z = "enrollment", st = "MN", org = "top", n = 10)
## Selecting by avgEnrollment

Data Sets 3, 4, 5 and 6

Wrapper Function

Running the code chunk below calls the wrapper function my_wrapper to read in and parse the four .csv files for the last four data sets. The resulting objects (pstA, pstB, pstC, and pstD) are four lists containing two tibbles each (pstA$countyData, pstA$noncountyData, pstB$countyData, pstB$noncountyData, pstC$countyData, pstC$noncountyData, pstD$countyData and pstD$noncountyData).

pstA <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01a.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST015171N1, PST015171N2, PST015172N1, PST015172...
## dbl (20): PST015171F, PST015171D, PST015172F, PST015172D, PST015173F, PST015...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pstB <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01b.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST025182N1, PST025182N2, PST025183N1, PST025183...
## dbl (20): PST025182F, PST025182D, PST025183F, PST025183D, PST025184F, PST025...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pstC <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01c.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST035191N1, PST035191N2, PST035192N1, PST035192...
## dbl (20): PST035191F, PST035191D, PST035192F, PST035192D, PST035193F, PST035...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pstD <- my_wrapper("https://www4.stat.ncsu.edu/~online/datasets/PST01d.csv")
## Rows: 3198 Columns: 42
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Area_name, STCOU, PST045200N1, PST045200N2, PST045201N1, PST045201...
## dbl (20): PST045200F, PST045200D, PST045201F, PST045201D, PST045202F, PST045...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Combine Function

Running the code chunk below calls the combine function combineEnrollment three times to combine pstA, pstB, pstC, and pstD into one list (eduAB) containing two tibbles corresponding to county level data (countyData) and non-county level data (noncountyData).

pstAB <- combineEnrollment(pstA, pstB)
pstCD <- combineEnrollment(pstC, pstD)
pstABCD <- combineEnrollment(pstAB, pstCD)
pstABCD
## $countyData
## # A tibble: 125,800 × 6
##    area_name   STCOU measurementType schoolYear studentsEnrolled state
##    <chr>       <chr> <chr>                <dbl>            <dbl> <chr>
##  1 Autauga, AL 01001 PST0151               1971            25508 AL   
##  2 Autauga, AL 01001 PST0151               1972            27166 AL   
##  3 Autauga, AL 01001 PST0151               1973            28463 AL   
##  4 Autauga, AL 01001 PST0151               1974            29266 AL   
##  5 Autauga, AL 01001 PST0151               1975            29718 AL   
##  6 Autauga, AL 01001 PST0151               1976            29896 AL   
##  7 Autauga, AL 01001 PST0151               1977            30462 AL   
##  8 Autauga, AL 01001 PST0151               1978            30882 AL   
##  9 Autauga, AL 01001 PST0151               1979            32055 AL   
## 10 Autauga, AL 01001 PST0251               1981            31985 AL   
## # … with 125,790 more rows
## 
## $noncountyData
## # A tibble: 2,120 × 6
##    area_name     STCOU measurementType schoolYear studentsEnrolled division
##    <chr>         <chr> <chr>                <dbl>            <dbl> <chr>   
##  1 UNITED STATES 00000 PST0151               1971        206827028 ERROR   
##  2 UNITED STATES 00000 PST0151               1972        209283904 ERROR   
##  3 UNITED STATES 00000 PST0151               1973        211357490 ERROR   
##  4 UNITED STATES 00000 PST0151               1974        213341552 ERROR   
##  5 UNITED STATES 00000 PST0151               1975        215465246 ERROR   
##  6 UNITED STATES 00000 PST0151               1976        217562728 ERROR   
##  7 UNITED STATES 00000 PST0151               1977        219759860 ERROR   
##  8 UNITED STATES 00000 PST0151               1978        222095080 ERROR   
##  9 UNITED STATES 00000 PST0151               1979        224567234 ERROR   
## 10 UNITED STATES 00000 PST0251               1981        229466391 ERROR   
## # … with 2,110 more rows

Custom Plot Function for State Class

Running the code chunk below calls the custom plot.state function to plot a line graph of the mean of the enrollment statistic (avgEnrollment) for each division per year observed (schoolYear) in the pstABCD tibble.

plot(pstABCD$noncountyData)
## `summarise()` has grouped output by 'division'. You can override using the
## `.groups` argument.

Custom Plot Function for County Class

User Specifications 4

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (studentsEnrolled) for the 6 (n) counties (area_name) with the largest (org) enrollment values in Connecticut (st) across the years (schoolYear).

plot(pstABCD$countyData, st = "CT", org = "top", n = 6)
## Selecting by avgEnrollment

User Specifications 5

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (studentsEnrolled) for the 10 (n) counties (area_name) with the smallest (org) enrollment values in North Carolina (st) across the years (schoolYear).

plot(pstABCD$countyData, st = "NC", org = "bottom", n = 10)
## Selecting by avgEnrollment

Default Specifications 2

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (studentsEnrolled) for the 5 (n) counties (area_name) with the largest (org) enrollment values in Illinois (st) across the years (schoolYear).

plot(pstABCD$countyData)
## Selecting by avgEnrollment

User Specifications 6

Running the code chunk below calls the custom plot.county function to plot a line graph of the enrollment statistic (studentsEnrolled) for the 4 (n) counties (area_name) with the largest (org) enrollment values in Minnesota (st) across the years (schoolYear).

plot(pstABCD$countyData, st = "MN", n = 4)
## Selecting by avgEnrollment